Diagnostic tests for OLS

MACS 30200
University of Chicago

May 3, 2017

Assumptions of linear regression

\[Y_i = \alpha + \beta x_i + \epsilon_i\]

Linearity

\[E(\epsilon_i) \equiv E(\epsilon | x_i) = 0\]

Constant variance

\[V(\epsilon | x_i) = \sigma_{\epsilon}^2\]

Normality

\[\epsilon \sim N(0, \sigma_\epsilon^2)\]

Independence

  • Any pair of errors \(\epsilon_i\) and \(\epsilon_j\) are independent for \(i \neq j\)

\(X\)

  • \(X\) is assumed to be fixed
  • \(X\) is measured without error and independent of the error

    \[\epsilon_i \sim N(0, \sigma_\epsilon^2), \text{for } i = 1, \dots, n\]

\(X\) is not invariant

Handling violations of assumptions

  • Makes inference difficult
  • Move to robust inferential method
    • Nonparametric regression
    • Decision trees
    • Support vector machines
  • Diagnose assumption violations and impose solutions within linear regression framework

Unusual and influential data

  • Outlier
  • Leverage
  • Discrepancy
  • Influence

    \[\text{Influence} = \text{Leverage} \times \text{Discrepancy}\]

Unusual and influential data

Measuring leverage

\[h_i = \frac{1}{n} + \frac{(X_i - \bar{X})^2}{\sum_{i'=1}^{n} (X_{i'} - \bar{X})^2}\]

\[h_i = \mathbf{X}_i (\mathbf{X'X})^{-1} \mathbf{X'}_i\]

  • It is solely a function of \(X\)
  • Larger values indicate higher leverage
  • \(\frac{1}{n} \leq h_i \leq 1\)
  • \(\bar{h} = \frac{(p + 1)}{n}\)

Measuring discrepancy

  • Residuals

    \[V(E_i) = \sigma_\epsilon^2 (1 - h_i)\]

  • Standardized residuals

    \[E'_i \equiv \frac{E_i}{S_{E} \sqrt{1 - h_i}}\]

    \[S_E = \sqrt{\frac{E_i^2}{(n - k - 1)}}\]

  • Studentized residuals

    \[E_i^{\ast} \equiv \frac{E_i}{S_{E(-i)} \sqrt{1 - h_i}}\]

Measuring influence

  • \(\text{DFBETA}_{ij}\)

    \[D_{ij} = \hat{\beta_j} - \hat{\beta}_{j(-i)}, \text{for } i=1, \dots, n \text{ and } j = 0, \dots, k\]

  • \(\text{DFBETAS}_{ij}\)

    \[D^{\ast}_{ij} = \frac{D_{ij}}{SE_{-i}(\beta_j)}\]

  • Cook’s \(D\)

    \[D_i = \frac{E^{'2}_i}{k + 1} \times \frac{h_i}{1 - h_i}\]

Visualizing outliers

  • Outcome of interest - number of federal laws struck down by SCOTUS
  1. Age - the mean age of the members of the Supreme Court
  2. Tenure - mean tenure of the members of the Court
  3. Unified - a dummy variable indicating whether or not the Congress was controlled by the same party in that period

Visualizing outliers

##          term estimate std.error statistic  p.value
## 1 (Intercept) -12.1034    2.5432     -4.76 6.57e-06
## 2         age   0.2189    0.0448      4.88 4.01e-06
## 3      tenure  -0.0669    0.0643     -1.04 3.00e-01
## 4     unified   0.7176    0.4584      1.57 1.21e-01

Visualizing outliers

Visualizing outliers

Hat values

Anything exceeding twice the average \(\bar{h}\) is noteworthy

## # A tibble: 9 × 10
##   Congress congress nulls   age tenure unified  year    hat student
##      <chr>    <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl>  <dbl>   <dbl>
## 1      1st        1     0  49.8    0.8       1  1789 0.0974   0.330
## 2      3rd        3     0  52.8    4.2       0  1793 0.1132   0.511
## 3     12th       12     0  49.0    6.6       1  1811 0.0802   0.669
## 4     17th       17     0  59.0   16.6       1  1821 0.0887  -0.253
## 5     20th       20     0  61.7   17.4       1  1827 0.0790  -0.577
## 6     23rd       23     0  64.0   18.4       1  1833 0.0819  -0.844
## 7     34th       34     0  64.0   14.6       0  1855 0.0782  -0.561
## 8     36th       36     0  68.7   17.8       0  1859 0.1020  -1.072
## 9     99th       99     3  71.9   16.7       0  1985 0.0912   0.295
## # ... with 1 more variables: cooksd <dbl>

Studentized residuals

Anything outside of the range \([-2,2]\) is discrepant

## # A tibble: 7 × 10
##   Congress congress nulls   age tenure unified  year    hat student cooksd
##      <chr>    <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl>  <dbl>   <dbl>  <dbl>
## 1     67th       67     6  66.0    9.0       1  1921 0.0361    2.14 0.0415
## 2     74th       74    10  71.1   14.2       1  1935 0.0514    4.42 0.2229
## 3     90th       90     6  64.7   13.3       1  1967 0.0195    2.49 0.0292
## 4     91st       91     6  65.1   13.0       1  1969 0.0189    2.42 0.0269
## 5     92nd       92     5  62.0    9.2       1  1971 0.0146    2.05 0.0150
## 6     98th       98     7  69.9   14.7       0  1983 0.0730    3.02 0.1655
## 7    104th      104     8  60.6   12.5       1  1995 0.0208    4.48 0.0897

Influence

\[D_i > \frac{4}{n - k - 1}\]

## # A tibble: 4 × 10
##   Congress congress nulls   age tenure unified  year    hat student cooksd
##      <chr>    <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl>  <dbl>   <dbl>  <dbl>
## 1     67th       67     6  66.0    9.0       1  1921 0.0361    2.14 0.0415
## 2     74th       74    10  71.1   14.2       1  1935 0.0514    4.42 0.2229
## 3     98th       98     7  69.9   14.7       0  1983 0.0730    3.02 0.1655
## 4    104th      104     8  60.6   12.5       1  1995 0.0208    4.48 0.0897

How to treat unusual observations

  • Mistake
  • Weird observation
    • Something unusual happened to that observation
    • No apparent reason
  • To drop or not to drop

How to treat unusual observations

## [1] 0.232
## [1] 0.258
## [1] 1.68
## [1] 1.29

Non-normally distributed errors

\[\epsilon \sim N(0, \sigma_\epsilon^2)\]

Quantile comparison plot

## # A tibble: 3,997 × 4
##      age    sex compositeHourlyWages yearsEducation
##    <int>  <chr>                <dbl>          <int>
##  1    40   Male                10.56             15
##  2    19   Male                11.00             13
##  3    46   Male                17.76             14
##  4    50 Female                14.00             16
##  5    31   Male                 8.20             15
##  6    30 Female                16.97             13
##  7    61 Female                 6.70             12
##  8    46 Female                14.00             14
##  9    43   Male                19.20             18
## 10    17   Male                 7.25             11
## # ... with 3,987 more rows
##             term estimate std.error statistic   p.value
## 1    (Intercept)   -8.124   0.59898     -13.6  5.27e-41
## 2        sexMale    3.474   0.20701      16.8  4.04e-61
## 3 yearsEducation    0.930   0.03426      27.1 5.47e-149
## 4            age    0.261   0.00866      30.2 3.42e-180

Density plot

Fixing non-normally distributed errors

  • Power and log transformations
##             term estimate std.error statistic   p.value
## 1    (Intercept)   1.0990  0.037965      28.9 1.97e-167
## 2        sexMale   0.2245  0.013121      17.1  2.16e-63
## 3 yearsEducation   0.0559  0.002171      25.7 2.95e-135
## 4            age   0.0182  0.000549      33.1 4.50e-212

Non-constant error variance

\[\text{Var}(\epsilon_i) = \sigma^2\]

  • Homoscedasticity
  • Heteroscedasticity

Detecting heteroscedasticity

Detecting heteroscedasticity

Breusch-Pagan test

  • Estimate an OLS model and obtain the squared residuals \(\hat{\epsilon}^2\)
  • Regress \(\hat{\epsilon}^2\) against all the \(k\) variables you think might be causing the heteroscedasticity
  • Calculate (\(n R^2_{\hat{\epsilon}^2}\))
    • \(\chi^2_{(k-1)}\) distribution
    • Rejecting the null hypothesis indicates heteroscedasticity is present

Breusch-Pagan test

## 
##  studentized Breusch-Pagan test
## 
## data:  sim_homo_mod
## BP = 3, df = 1, p-value = 0.08
## 
##  studentized Breusch-Pagan test
## 
## data:  sim_hetero_mod
## BP = 100, df = 1, p-value <2e-16

Weighted least squares regression

\[ \begin{bmatrix} \sigma_1^2 & 0 & 0 & 0 \\ 0 & \sigma_2^2 & 0 & 0 \\ 0 & 0 & \ddots & 0 \\ 0 & 0 & 0 & \sigma_n^2 \\ \end{bmatrix} \]

Weighted least squares regression

\[ \mathbf{W} = \begin{bmatrix} \frac{1}{\sigma_1^2} & 0 & 0 & 0 \\ 0 & \frac{1}{\sigma_2^2} & 0 & 0 \\ 0 & 0 & \ddots & 0 \\ 0 & 0 & 0 & \frac{1}{\sigma_n^2} \\ \end{bmatrix} \]

Weighted least squares regression

\[\hat{\mathbf{\beta}} = (\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\mathbf{y}\]

\[\hat{\mathbf{\beta}} = (\mathbf{X}' \mathbf{W} \mathbf{X})^{-1} \mathbf{X}' \mathbf{W} \mathbf{y}\]

\[\sigma_{\epsilon}^2 = \frac{\sum(w_i E_i^2)}{n}\]

Estimating the weights

  1. Use the residuals from a preliminary OLS regression to obtain estimates of the error variance
  2. Model the weights as a function of observable variables in the model

Estimating the weights

##             term estimate std.error statistic   p.value
## 1    (Intercept)   -8.124   0.59898     -13.6  5.27e-41
## 2        sexMale    3.474   0.20701      16.8  4.04e-61
## 3 yearsEducation    0.930   0.03426      27.1 5.47e-149
## 4            age    0.261   0.00866      30.2 3.42e-180
##             term estimate std.error statistic p.value
## 1    (Intercept)   -8.099   0.01601      -506       0
## 2        sexMale    3.474   0.00977       356       0
## 3 yearsEducation    0.927   0.00142       653       0
## 4            age    0.261   0.00017      1534       0

Corrections for the variance-covariance estimates

  • Standard errors
  • Huber-White standard errors

Corrections for the variance-covariance estimates

##             term estimate std.error statistic   p.value std.error.rob
## 1    (Intercept)   -8.124   0.59898     -13.6  5.27e-41       0.63615
## 2        sexMale    3.474   0.20701      16.8  4.04e-61       0.20725
## 3 yearsEducation    0.930   0.03426      27.1 5.47e-149       0.03849
## 4            age    0.261   0.00866      30.2 3.42e-180       0.00881

Non-linearity in the data

\[\epsilon \sim N(0, \sigma_\epsilon^2)\]

  • The relationship between \(X_1\) and \(Y\) is nonlinear
  • The relationship between \(X_1\) and \(Y\) is conditional on \(X_2\)

Partial residual plots

\[E_i^{(j)} = E_i + B_j X_{ij}\]

Partial residual plots

Correcting for nonlinearity

\[\log(\text{Wage}) = \beta_0 + \beta_1(\text{Male}) + \beta_2 \text{Age} + \beta_3 \text{Age}^2 + \beta_4 \text{Education}^2\]

##                  term  estimate std.error statistic   p.value
## 1         (Intercept)  0.396819  5.78e-02      6.87  7.62e-12
## 2             sexMale  0.221458  1.24e-02     17.79  3.21e-68
## 3 I(yearsEducation^2)  0.001805  7.86e-05     22.96 1.19e-109
## 4                 age  0.083018  3.19e-03     26.05 2.93e-138
## 5            I(age^2) -0.000852  4.10e-05    -20.78  3.85e-91

Correcting for nonlinearity

  • Partial residuals

    \[E_i^{\text{Age}} = 0.083 \times \text{Age}_i -0.0008524 \times \text{Age}^2_i + E_i\]

    \[E_i^{\text{Education}} = 0.002 \times \text{Education}^2_i + E_i\]

  • Partial fits

    \[\hat{Y}_i^{(\text{Age})} = 0.083 \times \text{Age}_i -0.0008524 \times \text{Age}^2_i\]

    \[\hat{Y}_i^{(\text{Education})} = 0.002 \times \text{Education}^2_i\]

Correcting for nonlinearity

Collinearity

Where explanatory variables are correlated with one another

Perfect collinearity

## 
## Call:
## lm(formula = mpg ~ disp + wt + cyl, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.403 -1.403 -0.495  1.339  6.072 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 41.10768    2.84243   14.46  1.6e-14 ***
## disp         0.00747    0.01184    0.63   0.5332    
## wt          -3.63568    1.04014   -3.50   0.0016 ** 
## cyl         -1.78494    0.60711   -2.94   0.0065 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.59 on 28 degrees of freedom
## Multiple R-squared:  0.833,  Adjusted R-squared:  0.815 
## F-statistic: 46.4 on 3 and 28 DF,  p-value: 5.4e-11

Perfect collinearity

## 
## Call:
## lm(formula = mpg ~ disp + wt + cyl + disp_mean, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.403 -1.403 -0.495  1.339  6.072 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 41.10768    2.84243   14.46  1.6e-14 ***
## disp         0.00747    0.01184    0.63   0.5332    
## wt          -3.63568    1.04014   -3.50   0.0016 ** 
## cyl         -1.78494    0.60711   -2.94   0.0065 ** 
## disp_mean         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.59 on 28 degrees of freedom
## Multiple R-squared:  0.833,  Adjusted R-squared:  0.815 
## F-statistic: 46.4 on 3 and 28 DF,  p-value: 5.4e-11

Perfect collinearity

Less-than-perfect collinearity

Less-than-perfect collinearity

##          term estimate std.error statistic   p.value
## 1 (Intercept) -173.411  43.82839     -3.96  9.01e-05
## 2         age   -2.291   0.67248     -3.41  7.23e-04
## 3       limit    0.173   0.00503     34.50 1.63e-121

Less-than-perfect collinearity

##          term  estimate std.error statistic  p.value
## 1 (Intercept) -377.5368   45.2542    -8.343 1.21e-15
## 2       limit    0.0245    0.0638     0.384 7.01e-01
## 3      rating    2.2017    0.9523     2.312 2.13e-02

Less-than-perfect collinearity

Correlation matrix

Scatterplot matrix

Here it is very clear that limit and rating are strongly correlated with one another.

Variance inflation factor (VIF)

  • Ratio of the variance of \(\hat{\beta}_j\) when fitting the full model divided by the variance of \(\hat{\beta}_j\) if fit on its own model
  • Rule of thumb - greater than 10

Variance inflation factor (VIF)

##   age limit 
##  1.01  1.01
##  limit rating 
##    160    160

Fixing multicollinearity

  • Drop one or more of the collinear variables from the model - NO!
  • Add more data
  • Transform the variables
  • Shrinkage methods